In this notebook, we look at the oxidation present in the raw data set and show that taking them into account or not, changes the merged data set.


In [1]:
from msdas import *
%pylab inline


Couldn't import dot_parser, loading of dot files will not be possible.
Populating the interactive namespace from numpy and matplotlib

In [2]:
r = MassSpecReader(get_yeast_raw_data())


INFO:root:Reading /home/cokelaer/Work/github/msdas/share/data/Yeast_all_raw.csv
INFO:root:Renaming psites with ^ character
INFO:root:Replacing zeros with NAs
INFO:root:-- 200 rows have ambiguous psites and are removed
INFO:root:save data in attribute _ambiguous_psites_df
INFO:root:--------------------------------------------------
INFO:root:-- Removing 125 rows with ambigous protein names:
INFO:root:--------------------------------------------------
WARNING:root:Rebuilding identifier in the dataframe. MERGED prefixes will be lost
WARNING:root:Identifiers are not unique. Have you called merge_peptides() ?

In [3]:
r.df.shape


Out[3]:
(8570, 115)

In [4]:
r.plot_phospho_stats()



In [5]:
# which row contains Oxidation in its sequence ? 
df = r.df[r.df.Sequence_Phospho.apply(lambda x: "Oxidation" in x)]
# we can build an new MassSpecReader instance from this dataframe:
oxidation = MassSpecReader(df)


INFO:root:Renaming psites with ^ character
INFO:root:Replacing zeros with NAs
INFO:root:-- Removing 0 rows with ambigous protein names:
INFO:root:--------------------------------------------------
WARNING:root:Rebuilding identifier in the dataframe. MERGED prefixes will be lost
WARNING:root:Identifiers are not unique. Have you called merge_peptides() ?

In [6]:
oxidation.df.shape


Out[6]:
(415, 115)

In [7]:
oxidation.plot_phospho_stats()
# it looks like it is representative of the big data set(see figures above)



In [8]:
# similarly for the numner of NAs
clf()
r.get_na_count().hist(normed=True, alpha=0.5)
oxidation.get_na_count().hist(normed=True, alpha=0.5, color="green")
# Here we see that number of NAs


Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x8b3c350>

Let us figure out if some proteins in the small data set have peptides with oxidation:


In [9]:
y = MassSpecReader(get_yeast_small_data())


INFO:root:Reading /home/cokelaer/Work/github/msdas/share/data/YEAST_small_all.csv
INFO:root:Renaming psites with ^ character
INFO:root:Replacing zeros with NAs
INFO:root:-- Removing 0 rows with ambigous protein names:
INFO:root:--------------------------------------------------
WARNING:root:Rebuilding identifier in the dataframe. MERGED prefixes will be lost

In [10]:
proteins = list(set(y.df.Protein))

In [11]:
filter_proteins  = oxidation.df.Protein.apply(lambda x: x in proteins)
subdf = oxidation.df[filter_proteins].Protein
found = list(set(subdf.values))

In [12]:
found


Out[12]:
['STE20', 'RCK2', 'HOG1', 'STE12', 'FUS3']

Effect of the Oxidation in the merging from raw data to small data set.

We will look at STE12 case found in the list above.


In [13]:
Y = replicates.ReplicatesYeast(get_yeast_raw_data(), verbose=True, cleanup=True)
Y.normalise()


INFO:root:Reading /home/cokelaer/Work/github/msdas/share/data/Yeast_all_raw.csv
INFO:root:Renaming psites with ^ character
INFO:root:Replacing zeros with NAs
INFO:root:-- 200 rows have ambiguous psites and are removed
INFO:root:save data in attribute _ambiguous_psites_df
INFO:root:--------------------------------------------------
INFO:root:-- Removing 125 rows with ambigous protein names:
INFO:root:--------------------------------------------------
WARNING:root:Rebuilding identifier in the dataframe. MERGED prefixes will be lost
WARNING:root:Identifiers are not unique. Have you called merge_peptides() ?

In [14]:
clf(); 
res1 = y.plot_timeseries("STE12_S400")
res2 = Y.plot_timeseries("STE12_S400", color="g", markersize=5)


WARNING:root:More than 1 row found. Consider calling merging_ambiguous_peptides method

Here, we have in red the data from the small data set
In green, the two row data that correspond to peptide STE12_400. There are 2: one with oxidation tag and one without. But this is the same peptides (see next cell).
One green data set correspond exactly to the samll data set, so it shows that peptide with oxidation are removed, as confirmed by looking at the data set.


In [15]:
Y['STE12_S400']


Out[15]:
Protein Sequence Psite Sequence_Phospho a0_t0 a0_t0.1 a0_t0.2 a0_t1 a0_t1.1 a0_t1.2 ... a45_t10.2 a45_t20 a45_t20.1 a45_t20.2 a45_t45 a45_t45.1 a45_t45.2 Entry Entry_name Identifier
2881 STE12 LVSPSDPTSYMK S400 LVS(Phospho)PSDPTSYM(Oxidation)K NaN NaN NaN NaN NaN 0.000077 ... NaN NaN NaN NaN NaN NaN NaN P13574 STE12_YEAST STE12_S400
2882 STE12 LVSPSDPTSYMK S400 LVS(Phospho)PSDPTSYMK 0.000254 0.000207 0.000242 0.000186 0.000215 0.000179 ... 0.000092 0.000118 0.000125 NaN NaN 0.000197 0.000109 P13574 STE12_YEAST STE12_S400

2 rows × 115 columns


In [ ]: